This notebook demonstrates the experience of using ML Workbench to create a machine learning model for text classification and setting it up for online prediction. This is the "cloud run" version of previous notebook. Preprocessing, training, batch prediction, are all done in cloud with various of services. The cloud run can be distributed, so it can handle really large data. Although in this case there is little benefit with the small demo data, the purpose is to demonstrate the usage of cloud run mode of ML Workbench.
There are only a few things that need to change between "local run" and "cloud run":
Other than this, nothing else changes from local to cloud!
If you have any feedback, please send them to datalab-feedback@google.com.
In [2]:
# Make sure you have the processed data there.
!ls ./data
The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the cleaned data from the previous notebook and build a text classification model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.
For details of each command, run with --help. For example, "%%ml train --help".
This notebook shows the cloud version of every command, and gives the normal experience when building models are large datasets. However, we will still use the 20 newsgroup data.
In [4]:
!gsutil mb gs://datalab-mlworkbench-20newslab
In [5]:
!gsutil -m cp ./data/train.csv ./data/eval.csv gs://datalab-mlworkbench-20newslab
In [1]:
import google.datalab.contrib.mlworkbench.commands # This loads the '%%ml' magics
In [7]:
%%ml dataset create
name: newsgroup_data_gcs
format: csv
schema:
- name: news_label
type: STRING
- name: text
type: STRING
train: gs://datalab-mlworkbench-20newslab/train.csv
eval: gs://datalab-mlworkbench-20newslab/eval.csv
In [8]:
%%ml analyze --cloud
output: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs
features:
news_label:
transform: target
text:
transform: bag_of_words
In [ ]:
!gsutil -m rm -rf gs://datalab-mlworkbench-20newslab/transform # Delete previous results if any.
In [12]:
%%ml transform --cloud
output: gs://datalab-mlworkbench-20newslab/transform
analysis: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs
Click the links in output cell to monitor the jobs progress. Once they are completed (usually within 15 minutes with the job startup overhead), check the output.
In [13]:
!gsutil ls gs://datalab-mlworkbench-20newslab/transform
In [2]:
%%ml dataset create
name: newsgroup_data_gcs_transformed
format: transformed
train: gs://datalab-mlworkbench-20newslab/transform/train-*
eval: gs://datalab-mlworkbench-20newslab/transform/eval-*
In [ ]:
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!gsutil -m rm -fr gs://datalab-mlworkbench-20newslab/train
Note that, "runtime_version: '1.2'" specifies which TensorFlow version is used at training. The first time training is a bit slower because of warm up, but if you train it multiple times the runs after first will be faster.
In [3]:
%%ml train --cloud
output: gs://datalab-mlworkbench-20newslab/train
analysis: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs_transformed
model_args:
model: linear_classification
top-n: 5
cloud_config:
scale_tier: BASIC
region: us-central1
runtime_version: '1.2'
In [4]:
# Once training is done, check the output.
!gsutil list gs://datalab-mlworkbench-20newslab/train
In [2]:
%%ml batch_predict
model: gs://datalab-mlworkbench-20newslab/train/evaluation_model
output: gs://datalab-mlworkbench-20newslab/prediction
format: csv
data:
csv: gs://datalab-mlworkbench-20newslab/eval.csv
In [2]:
!gsutil ls gs://datalab-mlworkbench-20newslab/prediction/
In [5]:
%%ml evaluate confusion_matrix --plot
size: 15
csv: gs://datalab-mlworkbench-20newslab/prediction/predict_results_eval.csv
In [6]:
%%ml evaluate accuracy
csv: gs://datalab-mlworkbench-20newslab/prediction/predict_results_eval.csv
Out[6]:
In [7]:
%%ml predict
model: gs://datalab-mlworkbench-20newslab/train/model
data:
- nasa
- windows xp
In [8]:
%%ml model deploy
name: newsgroup.alpha
path: gs://datalab-mlworkbench-20newslab/train/model
In [9]:
# Let's create a CSV file from eval.csv by removing the target column.
with open('./data/eval.csv', 'r') as f, open('./data/test.csv', 'w') as fout:
for l in f:
fout.write(l.split(',')[1])
In [12]:
!gsutil cp ./data/test.csv gs://datalab-mlworkbench-20newslab/test.csv
In [13]:
%%ml batch_predict --cloud
model: newsgroup.alpha
output: gs://datalab-mlworkbench-20newslab/test
format: json
data:
csv: gs://datalab-mlworkbench-20newslab/test.csv
cloud_config:
region: us-central1
Once job is completed, take a look at the results.
In [14]:
!gsutil ls -lh gs://datalab-mlworkbench-20newslab/test
In [15]:
!gsutil cat gs://datalab-mlworkbench-20newslab/test/prediction.results* | head -n 2
In [16]:
%%ml model delete
name: newsgroup.alpha
In [ ]:
%%ml model delete
name: newsgroup
In [ ]:
# Delete the files in the GCS bucket, and delete the bucket
!gsutil -m rm -r gs://datalab-mlworkbench-20newslab
In [ ]: